Principal Components Analysis Using Averaged Eigenvector Centrality Over Secondary Structures¶
# Import necessary packages to perform the analysis demonstrated below.
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib_venn import venn2,venn2_circles
from matplotlib import pyplot as plt
import csv
from plotly.subplots import make_subplots
import plotly.graph_objects as go
Preliminary inspection of the data shows that there are differences in how eigenvector centrality is distributed across the primary structures of LRH-1 (See multiplot below).
Although this is readily observable on inspection, what is needed is a rigorous approach that demonstartes that these differences are categorical.
The principal Components analysis demonstrates a clear grouping.
fig, axs = plt.subplots(5,4,figsize=(30,20),sharey=True, sharex=True)
pdbList=[['1yok','1yuc','1zdu','3plz'],['3tx7','4dor','4dos','4oni'],['4ple','4rwv','6vc2','NA'],
['4pld','5l11','5syz','5unj'],['6oqx','6oqy','6or1','NA']]
path = "/Users/davidfoutch/Desktop/David/"
for j in range(5):
for k in range(4):
X = []
Y = []
if pdbList[j][k] != 'NA':
i=pdbList[j][k]
pathToFile=path+i+"/eigen_"+i+"_plt"
with open(pathToFile, 'r') as file:
plotting = csv.reader(file, delimiter='\t')
for ROWS in plotting:
X.append(int(ROWS[0]))
Y.append(float(ROWS[1]))
axs[j,k].plot(X, Y, color="black", label=i)
axs[j,k].legend(fontsize="12")
# plt.xticks(np.arange(210, 485, 25))
# # plt.title('')
fig.text(0.075, 0.5, 'Eigenvector Centrality', va='center', rotation='vertical',fontsize=18)
fig.text(0.5, 0.055, 'Residue Position', ha='center',fontsize=18)
# plt.savefig('nr5a1_plts.png')
Text(0.5, 0.055, 'Residue Position')
Eiegenvector Centrality Across LRH-1 Primary Structure. The x-axis plots the physical sequence (or, primary structure) of each residue of the protein. The y-axis plots the eigenvector centrality value for each residue in the sequence. On visual inspection it seems reasonable that these these protein systems may be clustered into two groups. It also seems reasonable that principal components analysis PCA)is a reasonable approach for investigating what features are driving this variance.
Secondary Protein Structures as Features¶
As indicated in the illustration below, the LRH-1 secondary structures were identified. Each of the structures were assigned an average eigenvector centrality value based on the eigenvector values of each residue in the structure. This was performed for each of the 18 LRH-1 structures under consideration.
Secondary Structure as Features. The LRH-1 protein is a sequence of 242 residues. To purpose of this PCA is to find some measure of variation in the eigenvector values that might reproduce the visual grouping of the 18 structures. The approach we adopted was to identify the primary structure (or, sequence) associated with each of the secondary structures and average the eigenvector values for each residue within each secondary structure. This figure illustrates how each secondary structure (loops, helices, and sheets) was treated as a group of residues. Here the protein structures associated with each PDB file are the targets and the secondary structures are the features (See table below).
Protein Structures as Targets (rows) and Secondary Structures as Features (columns)¶
#Load the table as a Pandas dataframe.
"""Note: That the 'pdb_file' entries contain the list of 'targets' used in the PCA described below.
The column names are the list of secondary structures of LRH-1. Each (i,j) entry is the average EC value
for the secondary structure associated with the PDB file of that row."""
df=pd.read_csv('/Users/davidfoutch/Desktop/David/secondary_structure_pca.txt',sep='\t',header=0)
df.style.set_table_attributes('style="font-size: 16px"')
| pdb_file | helix1 | helix2 | loop1 | helix3 | helix4 | helix5 | sheet1 | loop2 | sheet2 | helix6 | helix7 | helix8 | loop3 | helix9 | loop4 | helix10 | helix11 | loop5 | helix12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1yok | 0.000576 | 0.003654 | 0.000314 | 0.031029 | 0.001018 | 0.052709 | 0.150762 | 0.053812 | 0.060763 | 0.026101 | 0.042471 | 0.004820 | 0.000790 | 0.000157 | 0.000023 | 0.000498 | -0.029088 | 0.004879 | 0.061481 |
| 1 | 1yuc | 0.000743 | 0.012840 | 0.000890 | 0.057354 | 0.001669 | 0.060717 | 0.056455 | 0.076121 | 0.018289 | 0.008388 | 0.008053 | 0.005091 | 0.003343 | 0.000391 | 0.000141 | 0.001046 | -0.048620 | 0.027062 | 0.039156 |
| 2 | 1zdu | 0.000202 | 0.007947 | 0.000460 | 0.050669 | 0.002127 | 0.050298 | 0.039572 | 0.072325 | 0.017996 | 0.018922 | 0.018822 | 0.002609 | 0.001172 | 0.000081 | 0.000011 | 0.000377 | -0.050530 | 0.020620 | 0.072636 |
| 3 | 3plz | 0.001271 | 0.015656 | 0.000000 | 0.055139 | 0.004365 | 0.054102 | 0.127785 | 0.116407 | 0.011096 | 0.025060 | 0.021333 | 0.004178 | 0.000761 | 0.001090 | 0.000129 | 0.000836 | -0.015597 | 0.023692 | 0.037046 |
| 4 | 3tx7 | 0.000625 | 0.005513 | 0.000427 | 0.044520 | 0.000000 | 0.043169 | 0.104700 | 0.075048 | 0.072745 | 0.061702 | 0.030444 | 0.001431 | 0.000022 | 0.000218 | 0.000203 | 0.000574 | -0.025690 | 0.020532 | 0.054883 |
| 5 | 4dor | 0.002067 | 0.010955 | 0.001888 | 0.067321 | 0.004443 | 0.058594 | 0.054307 | 0.080375 | 0.018431 | 0.016435 | 0.029738 | 0.007632 | 0.000997 | 0.000904 | 0.000169 | 0.001584 | -0.028626 | 0.028174 | 0.040670 |
| 6 | 4dos | 0.000402 | 0.028617 | 0.005939 | 0.067000 | 0.000911 | 0.038745 | 0.059149 | 0.065611 | 0.060590 | 0.048827 | 0.022908 | 0.002705 | 0.000157 | 0.000440 | 0.000227 | 0.001477 | 0.034294 | 0.036000 | 0.061556 |
| 7 | 4oni | 0.000312 | 0.008313 | 0.000734 | 0.033683 | 0.000433 | 0.057376 | 0.145752 | 0.071987 | 0.021481 | 0.015991 | 0.030568 | 0.001800 | 0.001175 | 0.000278 | 0.000082 | 0.000854 | -0.044903 | 0.017223 | 0.018641 |
| 8 | 4pld | 0.019235 | 0.030206 | 0.007326 | 0.081821 | 0.029528 | 0.085718 | 0.076756 | 0.061078 | 0.019302 | 0.014691 | 0.054611 | 0.056935 | 0.008931 | 0.023157 | 0.004250 | 0.023564 | -0.070372 | 0.014760 | 0.064728 |
| 9 | 4ple | 0.001660 | 0.002323 | 0.001876 | 0.045064 | 0.001737 | 0.046692 | 0.049928 | 0.015369 | 0.048707 | 0.063348 | 0.031489 | 0.005911 | 0.001757 | 0.000692 | 0.000124 | 0.000884 | -0.046582 | 0.030244 | 0.069936 |
| 10 | 4rwv | 0.000914 | 0.003979 | 0.000283 | 0.043673 | 0.001150 | 0.043446 | 0.059859 | 0.053475 | 0.037659 | 0.042981 | 0.027650 | 0.003051 | 0.001014 | 0.000204 | 0.000038 | 0.000390 | -0.053159 | 0.010822 | 0.074276 |
| 11 | 5l11 | 0.015617 | 0.033968 | 0.012121 | 0.087246 | 0.020994 | 0.090050 | 0.045139 | 0.044618 | 0.016428 | 0.025267 | 0.059047 | 0.058966 | 0.012628 | 0.021853 | 0.007340 | 0.035950 | -0.058355 | 0.012441 | 0.040010 |
| 12 | 5syz | 0.014761 | 0.029265 | 0.010711 | 0.085142 | 0.024236 | 0.094928 | 0.075324 | 0.050406 | 0.017226 | 0.017767 | 0.050440 | 0.057061 | 0.008488 | 0.020446 | 0.007410 | 0.032384 | -0.050732 | 0.012630 | 0.053846 |
| 13 | 5unj | 0.018808 | 0.020364 | 0.001765 | 0.096289 | 0.025358 | 0.088195 | 0.041148 | 0.051062 | 0.011595 | 0.021755 | 0.047324 | 0.058565 | 0.003736 | 0.020208 | 0.002930 | 0.021128 | -0.039532 | 0.023337 | 0.063663 |
| 14 | 6oqx | 0.026081 | 0.020759 | 0.005134 | 0.083458 | 0.041482 | 0.096881 | 0.030195 | 0.034281 | 0.010749 | 0.018421 | 0.047028 | 0.086320 | 0.018635 | 0.038344 | 0.009920 | 0.045036 | -0.053163 | 0.012589 | 0.053940 |
| 15 | 6oqy | 0.013663 | 0.026514 | 0.004330 | 0.087202 | 0.020875 | 0.086448 | 0.064650 | 0.067561 | 0.023111 | 0.030738 | 0.052478 | 0.051809 | 0.011350 | 0.018312 | 0.005780 | 0.027102 | -0.058918 | 0.011047 | 0.057732 |
| 16 | 6or1 | 0.032763 | 0.021224 | 0.003412 | 0.079031 | 0.029667 | 0.083683 | 0.027241 | 0.040638 | 0.010503 | 0.018645 | 0.044225 | 0.101694 | 0.011849 | 0.046109 | 0.012700 | 0.056007 | -0.060972 | 0.011454 | 0.056959 |
| 17 | 6vc2 | 0.000540 | 0.015342 | 0.000711 | 0.044759 | 0.001578 | 0.053363 | 0.145899 | 0.115398 | 0.021781 | 0.017204 | 0.018100 | 0.004535 | 0.000440 | 0.000243 | 0.000037 | 0.000387 | -0.006567 | 0.007745 | 0.010284 |
Table of Targets and Features. The rows in the table above represents the targets for the PCA, which are the individual LRH-1 structures. The columns represent features, which are the secondary structures. The values in the table are the averaged eigenvector centrality values for each secondary target.
Inspecting PC1 and PC2¶
df=pd.read_csv('/Users/davidfoutch/Desktop/David/transpose_for_pca.txt',sep='\t',header=0)
df=df.drop(['seq_pos','seq_id','sec_struct'], axis=1)
scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)
pca = PCA(n_components=2)
PC_scores = pd.DataFrame(pca.fit_transform(df_scaled),
columns = ['PC 1', 'PC 2'])
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'],
index=list(df.columns.values))
loadings.style.set_table_attributes('style="font-size: 16px"')
| PC1 | PC2 | |
|---|---|---|
| 1yok | 0.230914 | -0.181243 |
| 1yuc | 0.250479 | -0.148688 |
| 1zdu | 0.252150 | -0.204559 |
| 3plz | 0.250873 | -0.165109 |
| 3tx7 | 0.228274 | -0.262484 |
| 4dor | 0.262657 | -0.129846 |
| 4dos | 0.024133 | 0.314485 |
| 4oni | 0.234650 | -0.186726 |
| 4pld | 0.260003 | 0.169378 |
| 4ple | 0.218371 | -0.232409 |
| 4rwv | 0.231815 | -0.252638 |
| 5l11 | 0.253263 | 0.244395 |
| 5syz | 0.258122 | 0.243177 |
| 5unj | 0.254570 | 0.215808 |
| 6oqx | 0.217383 | 0.383928 |
| 6oqy | 0.273277 | 0.157543 |
| 6or1 | 0.208507 | 0.402805 |
| 6vc2 | 0.225098 | -0.099870 |
Generating the PCA Two Component Biplot¶
# The 'targets' are the protein structures represented by the four character name of each PDB file
# and the 'features' are the secondary structures (helices, strands, sheets, loops) associated with LRH-1
targets = ['1yok','1yuc','1zdu','3plz','3tx7','4dor','4dos','4oni','4pld','4ple','4rwv',
'5l11','5syz','5unj','6oqx','6oqy','6or1','6vc2']
features = ['helix1', 'helix2', 'loop1', 'helix3', 'helix4', 'helix5', 'sheet1',
'loop2', 'sheet2', 'helix6', 'helix7', 'helix8', 'loop3', 'helix9',
'loop4', 'helix10', 'helix11', 'loop5', 'helix12']
df=pd.read_csv('/Users/davidfoutch/Desktop/David/secondary_structure_pca.txt',sep='\t',header=0)
df2=df.drop(columns=['pdb_file'])
x = df2.loc[:,features].values
y = df['pdb_file']
y = y.to_frame()
x = StandardScaler().fit_transform(x)
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDF = pd.DataFrame(data = principalComponents,
columns = ['PC 1', 'PC 2'])
finalDF = pd.concat([principalDF, y], axis=1)
scalePC1 = 1.0/(finalDF['PC 1'].max() - finalDF['PC 1'].min())
scalePC2 = 1.0/(finalDF['PC 2'].max() - finalDF['PC 2'].min())
ldngs = pca.components_
featuresList = list(df2.columns.values)
fig, ax = plt.subplots(1,1,figsize=(16,9))
# ax = fig.add_subplot(1,1,1)
ldngs = pca.components_
featuresList = list(df2.columns.values)
for i, feature in enumerate(featuresList):
ax.arrow(0, 0, ldngs[0, i],
ldngs[1, i], alpha=0.1)
ax.text(ldngs[0, i] * 1.0,
ldngs[1, i] * 1.0,
feature, fontsize=12, alpha=0.5)
ax.set_xlabel('Principal Component 1', fontsize = 16)
ax.set_ylabel('Principal Component 2', fontsize = 16)
ax.set_title('Two Component Biplot PCA', fontsize = 20)
set9 = finalDF[finalDF['pdb_file'].isin(['1yok','1yuc','1zdu','3tx7','3plz','4dor', '4dos','4oni','4ple','4rwv','6vc2'])]
set10 = finalDF[finalDF['pdb_file'].isin(['4pld','5l11','5syz','5unj','6oqx','6oqy','6or1'])]
ax.scatter(set9['PC 1']*scalePC1, set9['PC 2']*scalePC2,color='red',marker='x',label='Small molecule')
ax.scatter(set10['PC 1']*scalePC1, set10['PC 2']*scalePC2,color='blue',label='Phospholipid')
ax.legend(loc='upper right',fancybox=True,fontsize=14)
ax.grid()
for i in range(11):
plt.text(set9.iloc[i,0]*scalePC1, set9.iloc[i,1]*scalePC2+0.01, set9.iloc[i,2], color='red',fontsize = 14)
for i in range(7):
plt.text(set10.iloc[i,0]*scalePC1, set10.iloc[i,1]*scalePC2-0.03, set10.iloc[i,2], color='blue',fontsize = 14)
# plt.savefig('/Users/davidfoutch/Desktop/David/new_two_component_biplot_pca.png')
Review of the Result and Interpretation¶
There are two key results that offer insight into why these two groups of proteins may be sorted by their energetic equilibrium, or $\Delta\Delta$G (see Fig 2). First, when the average eigenvector centrality of each secondary structure (helices, sheets, and loops) for each protein was used as features in the principal components analysis (PCA) the low $\Delta\Delta$G and the high $\Delta\Delta$G proteins clustereded nicely into two groups (see Fig 2A). What this indicates is that there is a set of secondary structures that are creating the variance between the equilibrium states. The bi-plot of the PCA indicated that helix 6 and helix 3 are significantfactors to understanding this variance between the two groups. Second, on inspection of the set of edges unique to the low $\Delta\Delta$G PSNs and high $\Delta\Delta$G PSNs. It is observed that there is a greater degree of connectivity between helices 3 and 6 at the enzymatic site indicating that the opening is more restricted requiring greater energy to achieve small molecule binding to reach energetic equilibrium (see Fig 2B).